Story of Minister Officials Travel Expenses

Introduction

From the three data sets, “Travel Expenses issued by Minister officials,” is interesting for me to do the exploratory data analysis. The use of information in the data set has Open Government License – Canada. This dataset consolidates all the travel expense reports submitted by federal institutions.

I would like to analyse the impact of the travel expenses for the total cost, organisational impact for the travel, and the organisations which has highest travel expenses. The reason why I would like to base my analysis in expenses is to see which category of expenses such as airfare, other transport, lodging, or meals which would affect the total cost of expenses. Also, I would like to see any travel patterns throughout the years to find out the reasoning for the fluctuations, whether it is due to number of organisations that has increased yearly or if there are particular organisations directly impacted by the travel expenses.

Proactive Disclosure of Travel expenses has two csv data sets,

  1. Proactive disclosure - Travel Expenses
  2. Proactive disclosure - Travel expenses Nothing to report

I will be considering only the proactive disclosure - travel expenses dataset. There is a data dictionary-Travel which talks about how data has been entered, optional columns, mandatory columns, and definitions of the column themes.

Proactive Disclosure - travel Expenses dataset consists of 21 columns, 12 columns that are mandatory and 9 that are optional. The data has been collected from 2003 December to 2021 March. Ten columns such as Title in English, Travel start date, Airfare, Other transportation, Lodging, Meals, Other Expenses, Total and owner organisation would interest me to achieve my goals. The tolal 11 columns are not considered. Among those, most columns are in French and the end date, person name, destination, and travel description columns are not required in my analysis.

Importing Data

Investigation about the data

As the NaN values less than 10% of the data , for the analysis NaN values will be replaces by zeros

Above infomation verifies null values have been replaced by zeros.

As in the EDA there are values in the bdataframe entered as negative values.Therefore assumed those values have been entered incorrectly and the whole dataframe going to be filtered for the less than zero values

Data Analysis

According to the Exploratory data analysis, it shows from 2017 to 2019 huge increase of travel. Therefore the data analysis going to be based on the 2016 to 2019 to find the reasons for the sudden increase.

First considering the data from year 2019, the highest travel expenses egsists among all the years form 2003 to 2021

Analysis of Data 2019

Relationship between owner organisations travels

To get an overall pitcture of the owner organisation travel within the year 2019, point map is used to plot total of expenses verses start date

Most of the travels associated with less than $5000 in total. Lets filter the data above 4000 in total

To get the total expense for travel per organisation in year 2019,data is grouped by owner organisation

Why dnd-mdn has highest expenses?

To find the answer to the following question, lets analyse the travel counts of top ten owner organisations

Filtering associated data for top ten owner organisations

When comparing travel counts ic and dnd-mdn has considerable high counts than the rest of the organisations. Also there is no significant difference of travel counts among ic and dnd-mdn. But considering total expenses dnd_mdn spent more than 50% of ic.

Lets introduce new column as level of expenses, the total of expenses equal or above refered to as 'high' and the total of expenses below as 'low'

Above graph gives an answer to the question, why dnd-mdn has highest travel expenses. Compare to all other nine organisations dnd-mdn has the highest number of travels and 30% of the travels categorized as high. Eventhough ic and dnd-mdn both have approximately equal numner travels but 10% of total travels represent as high in ic. Therefore it is better to analyse why the 30% of the dnd-mdn travels falls under high level of expenses

Why 30% of dnd-mdn total travels falls into high level of expenses in year 2019

Lets filter total of expenses above 3000 in 2019

Lets see the impact of travel expenses to the total

Above faceted plots indicates, the total of expenses mostly impacted by airfare

How does titles affect for the travel expenses

As travel expenses data represent in very large range for the analysis of titles only consider the total above and equal to 10000

When considering the total expenses veriation in titles, the above graph shows minister of national defense among the top. And the other all titles also related to the execative staff and all are overseas travel

Does airfare strongly affect for the increase of travel expenses from 2016 to 2018

First lets see the travel pattern of dnd-mdn travel pattern from 2003 to 2021

This indicates dnd-mdn has started from year 2015 and gradulally increase its number of travels and reduced in 2020 onwards

Does dnd-mdn is the cause for the travel expenses increase from 2016 to 2018 as well

Lets concentrate the travel cost above 3000

Lets melt the above dataframe to findout top 10 owner organisations

Lets filter the associated data to the top 10 owner organisations and findout the counts of travels by counting entries of the owner organisation because each entry equal to travel

Above counts of travel indicate the esdc-edsc has the highest number of travels form 2016 to 2018

It is not necessarily to be high in total expenses when count of travels are high because if all the travels are related within the Canada, htere is a possibility to result lower total expenses. Therefore for further conformation lets see how the total expenses vary among the top owner organisations and lets see esdc-edsc has the highest total expenses.

Here there is an exception has done because when compare the top 10 organisations in 2019 and the top ten in the period of 2016 to 2018, all the organisations are same except the organisation ic, prominet in 2019. Therefore for the top10 organisations from the year2016 to 2019 the owner organisation ic also has been included for the total expenses analysis

Discussion

My aim was to findout the reason for the sudden increase of the travel expenses from 2016 to 2019. When comparing the three years, 2019 travel expenses was so high. Therefore, I first concentrated the data associated to the year 2019. From the first step, "Relationship between organisation travels," identified most of the travel expenses less than 5000. From the plot, "Veriation of total Expenses," identified dnd-mdn was the main contributer for the travel expenses. When compared with the plot, " Expenses above 4000, " it shows dnd-mdn has considerable amount of expensive travels.

The second step quantitatively analysed the travels among owner organisations for the questiona: Why dnd-mdn has the highest expenses? The top 10 organisations for travel expenses have been subjected for the analysis. When comparing the count of travels dnd-mdn shows highest count but there is no significant difference compared to ic. After categorising the travels which cost 4000 above as "high" and below as "low", dnd-mdn shows highest number of travels in the high category compared to all top 10 organisations. The analysis confirmed dnd-mdn is the main contributer for the travel increase in the year 2019.

Next, the curiosity was extended to see why the dnd-mdn has expensive travels compare to the other organisations. In this step, considered expenses associated for the total cost. The faceted bar plots clearly shows airfare was the main contributer to the total expenses. Next another question came up, whose responsible for these travels, for this titles were considered, those who spent more than 10000 for thier travels. Minister of National Defense came first in the plot "Titles responsible for expensive travels" and other titles represents the executive category and all these travels associated with overseas travels, which makes airfare appear the most impacted by the total cost of travel in 2019.

The story of travel expenses is not yet completed because exploratory data analysis shows a period of time from year 2016 to 2019 where a sudden increase of travel expenses occured. Next, data from 2016 to 2019 were considered as top ten organisations that have been identified and obtained counts for travels in each organisation yearly. The esdc-edsc shows the highest count for travels in the plot "Count of travels".Reason for this is dnd-mdm has started his travels according to the data from 2015 and it has increased gradually. But when compared with the total of expenses dnd-mdn is the highest and it is more than 50% higher than any of other organisations. According to this data, its confirmed that the dnd-mdn travels main contributer to the travel expenses increased in the peried of 2016 to 2019.

It is a great work that has been done by the ministry officials disclosing their travel expenses to reflect the transperancy. During the analysis, a question came up to my mind: Why the dnd-mdn disclose their travel expenses since year 2015? Does that mean no travels would have occured prior to 2015?

According to the data which has been disclosed, dnd-mdn show gradual increase of travel expenses from 2015 to 2019 and a decrease in 2020 due to the pendemic travel band. Even though year by year shows tremendous increase of travel expenses it is not possible to come to a conclusion due to many reasons: entry errors - We have noticed few data points have been entered as negative values which is impossible to be negative, had to drop such data points,there could be entry errors which we have not noticed. Missing information also could have been possible among organisations.

Therefore, it is essential to improve the quality of data by avoiding the free text columns as much as possible. For example, if title and destination column would be replaced by the dropdown menu instead of free text we can improved the quality of data in great extent.

Therefore all the conclusions are made above based on the data disclosed and it is necessary to do more reseach, analysis, and improve on how to do proper conclusions.

Referances

  1. Customizing Visualizations https://altair-viz.github.io/user_guide/customization.html
  2. Comprehensive Guide to Grouping and Aggregating with Pandas https://pbpython.com/groupby-agg.html
  3. Data visualization Modules from 1-7

completed by: Pamodya Siriwardana Date: 2021-05-22